This report will discuss the developments since last semester on using code-based open-source software for teaching interactive data visualisation. The insights gained from developing an exemplar of interactive techniques applied to data analysis, will form the basis of discussion. Implications for teaching, such as the choice of techniques, software and datasets will then be discussed.
A narrative on how ideas for the exemplar emerged and changed over its development will be outlined and examined with respect to the insights gained in the process.
The motivation to develop an interactive exemplar for cluster analysis emerged from reading literature on the topic, as well as finding “real” data for which it was of interest to see if there was any such underlying structure. Data on the performance of New Zealand schools in the National Certificate of Educational Achievement (NCEA), at the four qualification levels, Level One, Two, Three (L1, L2, L3) and University Entrance (UE), were obtained from the New Zealand Qualifications Authority (NZQA) website. Information on the school decile, region and a “small” cohort warning, were also provided in the data. The decile rating is a measure of the general income level of the families of students attending the school. The socio-economic background of students increases as the decile increase from one to ten. A handful of schools have a decile rating of zero, due to unique circumstances that make them exempt from the socio-economic measure. The achievement rate of a school for each qualification level was quantified in a few ways. The achievement indicator chosen for this analysis was the proportion of students at the school who were successful in obtaining the qualification level, given that they were entered in enough standards to have the opportunity to earn the qualification in the 2016 school year. This is referred to as the “Current Year Achievement Rate” for the “Participating Cohort” by the NZQA (see http://www.nzqa.govt.nz/assets/Studying-in-NZ/Secondary-school-and-NCEA/stats-reports/NZQA-Secondary-Statistics-Consolidated-Data-Files-Short-Guide.pdf).
Only schools with achievement indicators across all four qualification levels were retained, thus reducing the dataset to 408 schools from around New Zealand. The focus of analysis will be on its subset of 91 Auckland schools, but the New Zealand dataset of 408 schools will be used to demonstrate how interactive techniques can be useful as the number of observations increases.
The focus of analysis will be on the Auckland subset because it is less affected by the unreliability of small sample sizes. The NZQA indicator of a “small” cohort was at a very low threshold of fewer than five candidates for any qualification level. Being the most populated city in New Zealand, Auckland has many of the larger schools, but there may still be some schools left in the analysis that have less than 30 candidates entered in a qualification level. The first few observations from the data set of 91 schools in the Auckland region are shown below.
## L1 L2 L3 UE Decile
## Al-Madinah School 0.889 1.000 1.000 0.640 2
## Albany Senior High School 0.905 0.904 0.882 0.701 10
## Alfriston College 0.659 0.651 0.563 0.369 2
A question that naturally arises from the NCEA data is whether a school’s performance is related to its decile rating.
The pairs plot in Figure 1 shows the achievement rates at L1, L2 and L3 have a weak relationship with decile rating, but the positive correlation is strong at UE. There appears to be an increasing “lower bound” to achievement rates for L1, L2 and L3, as decile increases, but there is a lot of scatter above this boundary. In the bivariate scatterplots we can also see the spread of achievement rates varying across decile groups. The variation in achievement rates decreases as the decile increases (from one), across all qualification levels. We can see many schools approaching the maximum 100% achievement rate, for L1, L2 and L3, hence it is not surprising to see their univariate distributions are skewed to the left in Figure 2. The distribution of achievement rates for UE is less skewed and hints at two possible groupings. Furthermore performance across the qualification levels appear to be positively correlated, especially between L1 and L2. Hence the low-dimensional plots indicate non-normality, unequal spread between groups and multicollinearity.
The pairs plot in Figure 2 provided a glimpse into the multivariate distribution of achievement rates across the four qualification levels. The parallel coordinates plot (PCP) shown below in Figure 3 allows us to further compare the multivariate distributions of achievement rates for different decile groups, as well as identify high dimensional clustering and outliers.
The ordering of axes in a PCP greatly affects the quality of the graphical analysis (Unwin, 2015). In the case of the NCEA data, the natural ordering of the four qualification levels by difficulty, conincides with the recommendation from Cook and Swayne (2007) to order the axes based correlation. In addition, Unwin (2015) highlights the layering of colours also needs to be considered carefully, since the last group assigned a colour will dominate the other lines. Furthermore interactivity that enables the reordering of axes is also recommended.
The positive relationships previously identified in the pairs plot, should translate to parallel lines between axes in the PCP, as opposed to a “criss-cross” pattern for negative correlation. The static plot in Figure 3 questions whether the positive relationships hold true for schools with low achievement rates and in some decile groups. The higher decile schools in Auckland appear to dominate the high achievement rates across all qualification levels, while lower decile schools are inconsistent with each other in terms of their performance across the levels. Although there are only 91 observations (lines), it is quite difficult to identify even “ball park boundaries”, on the 11-point decile scale, to distinguish between “higher” and “lower” decile schools when describing possible patterns.
The following two plots demonstrate how alpha blending can help minimise the effects of overplotting as the number of observations increase. It is easier to check whether the patterns identified in Auckland schools extend to the 408 schools across New Zealand, using Figure 5 where alpha blending is applied, rather than Figure 4. The performance of high achieving lower decile schools is less “drowned out” by the dominance of their higher decile counterparts, when alpha blending is used.
The use of colour and interactive techniques are recommended for maximising the effectiveness of a PCP. Venables and Ripley (2002) argue parallel coordinate plots are “often too ‘busy’ without means of interaction” (p. 315). Figure 6 demonstrates how adding interactive filtering helps to further reduce the problems with overplotting and allows direct comparison between selected decile groups.
One of the strengths of a PCP is in identifying multivariate features, such as outliers (Cook & Swayne, 2007). Figure 3 suggests a school’s performance can be unusual in two ways, either it performs inconsistently to the bivariate correlations between the qualification levels (this was previously noted as typical of “lower” decile schools), or its achievement rates across the levels are unusual compared to the rest of its decile group. The latter becomes difficult to see for the New Zealand dataset, in the static plots (Figures 4 and 5), even with alpha blending applied. The filtering feature of Figure 6 overcomes the problems caused by overplotting by allowing the user to isolate the distribution of each decile group (via a double-click on the legend). Furthermore the interactive tooltip feature enables the possible outliers to be identified by school name, or observation number, as shown in Figure 6.
The PCP in Figure 3 suggests that patterns of performance across decile groups, if they exist, are difficult to distinguish at the multivariate level for Auckland schools. Principal component analysis (PCA) provides a way to reveal interesting multivariate structure, through finding projections of the data that show maximal variability (Venables & Ripley, 2002).
A plot of the first two principal components of the Auckland schools dataset is shown in Figure 7. The decile of the schools is represented in the colour and plotting symbol. The axes for the original variables, reflecting the loadings of the principal components, indicate that the first principal component considers the schools’ performance across all four qualification levels, while the second principal component contrasts performance in UE against the remaining qualifications. Not surprisingly the plot shows more spread across the first principal component since it explains a much greater proportion of the variation in the data. The first principal component reveals a division between the majority of schools and a smaller group, that is positioned away from the variable axes shown. We can see the smaller group of schools are decile five and below, except for one decile nine school. There is also a decile ten school that appears to be unusual when examining both principal components. The use of colour highlights the remaining decile nine and ten schools as similar in performance, as weighted by the principal components. On the other hand schools of other deciles seem to be more spread out from each other, as well as the variable axes.
The principal components plot in Figure 7 provided a view of the multivariate distribution of the Auckland schools dataset that was less easily affected by overplotting than a static PCP. Hence refinements on observations were possible. From the parallel coordinate plots we observed that “higher” decile schools performed more consistently with each other and dominated across the four qualification levels, but it was difficult to quantify the decile “cut off” for such schools. The PCA suggests these are the decile nine and ten Auckland schools, with the exception of two schools. One cannot help but wonder: Which are the two unusual decile nine and ten Auckland schools? The interactive tooltips in Figure 8 instantly satifies this curiosity and hence demonstrates the usefulness of being able to directly identify unusual observations. The linked brushing between visual representations of the PCA model and the original variables, also enable confirmation of previous observations and a better understanding of the PCA model. + closer to variable axes means what? Anything? Better performance at that level? + link to decile plot as well as pcp or pairs plot of performance across the 4 qualifications.